Analyze the adhesin properties of the Hyr/Iff-like (Hil) family in the Sacchromycetes This is version 4 of the analysis, using the expanded blast hits on 2022-05
The goal of the first analysis is to assess the evidence for each Hil family member as encoding yeast adhesins. We rely on a ML-based prediction algorithm and the following known sequence features of adhesins: signal peptide + GPI-anchor, tandem repeats, high S/T frequency (possibly glycosylation)
First get the basic information about the sequences in this study.
#sps.list <- c("Cduobushaemulonis","Cpseudohaemulonis","Chaemuloni","Cauris","Clusitaniae","Dhansenii","Cparapsilosis","Lelongisporus","Ctropicalis","Cdubliniensis","Calbicans","Sstipitis","Klactis","Ncastellii","Cglabrata","Nbracarensis","Ndelphensis","Nnivariensis")
blastInfo <- read_tsv("../data/expanded-blast-homologs-info.tsv", col_types = cols())# %>%
# mutate(species_id = factor(species, levels = sps.list), species = NULL)
FungalRV is a Support Vector Machine (SVM) based prediction algorithms that use sequence features such as amino acid composition (frequency, physiochemical properties etc.) as input and train Machine Learning models to distinguish fungal adhesins from non-adhesins.
frv.th = 0.511 # recommended FungalRV score threshold
frv <- read_tsv("../output/FungalRV/fungalRV-results.txt", skip = 3, col_names = c("name","frv.score"), col_types = "cd") %>%
mutate(name = str_sub(name, 2), frv.pred = frv.score > frv.th)
#if("frv.score" %in% names(seqInfo))
# seqInfo <- select(seqInfo, -frv.score, -frv.pred, -faa.score, -faa.pred)
#seqInfo <- seqInfo %>% left_join(frv) %>% left_join(faa)
GPI-anchored proteins are characterized by an N-terminal signal peptide, which would direct the protein to the secretary pathway, and a C-terminal GPI-anchor peptide, which would be cleaved and replaced by the GPI-anchor, allowing the protein to be tethered to the cell wall.
For signal peptide, I used the SignalP server. Its latest version is 6.0.
# Signal peptide
gff.names <- c("name", "source", "type", "start", "end", "prob", "na1", "na2", "na3")
signalp6 <- read_tsv("../output/web-download/signalp_6.0_result.gff3", comment = "#", col_names = gff.names, col_types = "ccciidccc")
#if("signalp" %in% names(seqInfo))
# seqInfo <- select(seqInfo, -signalp)
#
#seqInfo <- left_join(seqInfo, select(signalp5, name = id, prob), by = c("name" = "name")) %>%
# mutate(signalp = !is.na(prob)) %>% select(-prob)
For GPI-anchor prediction, I used the PredGPI server.
tmp <- read_delim("../output/web-download/predgpi-result-headline-only.txt", delim = "|", col_names = c("name","fp","omega"))
Rows: 215 Columns: 3── Column specification ───────────────────────────────────────────────────────────────────────
Delimiter: "|"
chr (3): name, fp, omega
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pred.gpi <- tmp %>%
mutate(name = str_sub(name,2,-2), # remove > and the trailing space
fp = as.numeric(str_sub(fp, 9, -2)), # extract the numeric part
gpi.pred = fp <= 0.01, # based on the cutoff of the PredGPI server (prob < 99% -> not GPI-anchored)
omega = str_sub(omega, 8),
cleaveRes = str_sub(omega, 1, 1),
cleavePos = as.integer(str_sub(omega, 3))
)
# remove the column if it already exists
#if("pred.gpi" %in% names(seqInfo))
# seqInfo <- select(seqInfo, -pred.gpi)
#seqInfo <- left_join(seqInfo, select(pred.gpi, name, pred.gpi = is.gpi), by = c("name"="name"))
adhesin <- blastInfo %>%
select(name, species, len = slen, pComplete) %>%
left_join(frv, by = "name") %>%
left_join(signalp6 %>% select(name, sp.prob = prob), by = "name") %>%
left_join(pred.gpi %>% select(name, gpi.pred, cleavePos), by = "name") %>%
mutate(sp.pred = !is.na(sp.prob)) %>%
relocate(c(frv.pred, sp.pred, gpi.pred), .after = pComplete)
Export the adhesin summary table
write_tsv(adhesin, file = "../output/table/20220818-Hil-family-size-adhesin-status-summary.tsv")
Plot the results alongside the species tree. First import the species tree
spsInfo <- read_tsv("../data/20220518-expanded-blast-species-info.tsv", col_types = cols())
sps.tree <- read.tree("../data/20220521-generax-species-tree.nwk") %>%
as_tibble() %>%
mutate(label = gsub("_", " ", label)) %>%
left_join(spsInfo, by = c("label" = "species")) %>%
as.treedata()
# to label the clades
clade <- c(
MDR = MRCA(sps.tree, "Candida auris", "Candida duobushaemulonis"),
CaLo = MRCA(sps.tree, "Candida parapsilosis", "Candida tropicalis"),
glabrata = MRCA(sps.tree, "Candida glabrata", "Candida nivariensis")
)
sps.tree <- groupClade(sps.tree, clade)
Because we will plot the results separately, it is important to generate the species tree object and extract the order of the species, so as to match the numbers to the species.
p.tree <- ggtree(sps.tree, ladderize = FALSE) + xlim(0,2.2) + scale_y_reverse() +
#geom_tiplab(aes(color = pathogen), as_ylab = TRUE) +
geom_tiplab(size = 3.2, fontface = "italic", align = TRUE, linesize = 0.1, offset = 0.05) +
geom_treescale(x = 0, width = 0.2, linesize = 1.2) +
geom_hilight(node = clade["MDR"], fill = "#7F00FF", alpha = 0.15) + # MDR
geom_hilight(node = clade["CaLo"], fill = "pink", alpha = 0.25) + # Candida/Lodderomyces
geom_hilight(node = clade["glabrata"], fill = "steelblue", alpha = 0.15) + # glabrata
#geom_cladelabel(node = clade["MDR"], label = "MDR", offset = 1.5,# color = "purple",
# offset.text = 0.05, angle = 270, hjust = .5, extend = 0.5) +# MDR
#geom_cladelabel(node = clade["CaLo"], label = "Candida/\nLodderomyces", offset = 1.47,# color = "hotpink2",
# offset.text = 0.1, angle = 270, hjust = .5, extend = 0.5, fontsize = 3.5) + # albicans
#geom_cladelabel(node = clade["glabrata"], label = "glabrata", offset = 1.38,# color = "steelblue",
# offset.text = 0.05, angle = 270, hjust = .5, extend = 0.5) +# glabrata
geom_tippoint(aes(color = pathogen)) +
scale_color_manual(values = c("crustacean" = "#6a5acd",
"human" = "#d14949",
"human (rare)" = "steelblue",
"no report" = "gray20")) +
#guides(color = guide_legend(byrow = TRUE)) +
theme(legend.position = c(0.12, 0.13))
Scale for 'y' is already present. Adding another scale for 'y', which will replace the
existing scale.
Summarize the results
df0 <- adhesin %>%
mutate(species = factor(species, levels = rev(get_taxa_name(p.tree)))) %>%
group_by(species) %>%
summarize(total = n(), FRV = sum(frv.pred), SP = sum(sp.pred), GPI = sum(gpi.pred),
final = sum(frv.pred & sp.pred & gpi.pred)) %>%
pivot_longer(total:final, names_to = "type", values_to = "number") %>%
mutate(type = factor(type, levels = unique(type))) %>%
# complete missing values for S. cerevisiae
complete(species, type, fill = list(number = 0))
p <- ggplot(df0, aes(x = type, y = species)) +
scale_y_discrete(limits = rev) +
geom_tile(aes(fill = number), color = "white", alpha = 0.4) +
geom_text(aes(label = number), color = "black", size = 4) +
scale_fill_distiller(palette = "Greys", direction = 1, limits = c(0, 20), oob = scales::squish) +
scale_x_discrete(position = "top") +
theme_cowplot() + theme(axis.title = element_blank(), axis.line = element_blank(), legend.position = "none")
ggsave(p, file = paste0("../output/img/",gsub("-", "", Sys.Date()), "-species-adhesin-prediction-summary.png"), width = 5, height = 7)
p1 <- p + theme(axis.text.y = element_blank(), axis.ticks.y = element_blank())
plot_grid(p.tree, p1, rel_widths = c(3,2), scale = c(1, 0.95))
Summarize the % of Hil genes passing all three tests
df0 %>% pivot_wider(names_from = "type", values_from = "number") %>%
select(species, total, final) %>%
summarize(total = sum(total), adhesin = sum(final))
The goal is to produce a schematic plot for each homolog outlining their main features, such as the locations of the PFam domains (mainly the Hyp_reg_CWP), locations of the signal peptide and GPI-anchor, distribution of TANGO sequences. Note that all these features can be represented as a range with associated metadata. So the first step is to collect the coordinates of the features
p.gene.tree <- ggtree(gene.tree, size = 0.4) + xlim(0,4) +
#geom_nodepoint(aes(fill = D), data = td_filter(D == "Y"), shape = 21, size = 1, color = "red") +
geom_tippoint(aes(color = group), size = 1) +
geom_tiplab(aes(color = family),
align = TRUE, linesize = 0.1, size = 1, offset = 0.05) +
#geom_cladelab(node = 357, label = "M. bicuspidata", offset.text = 0.05, angle = 270, hjust = .5, extend = 0.5) +
#geom_text(label = "D", data = td_filter(D == "Y"), hjust = -0.4, size = 1.5, color = "red")# +
scale_color_manual(name = "Clade", values = clade.cols)# +
p.gene.tree
ggsave(filename = "../output/img/20220916-gene-tree-side.png", width = 5, height = 7)
Extract gene tree order
genetreeOrder <- get_taxa_name(p.gene.tree)
# summarize stats of tandem repeats
repeats <- tandem %>%
group_by(type, period) %>%
summarize(n = n(), copyMean = mean(copyN), .groups = "drop") %>%
mutate(length = period * copyMean)
tr <- tandem %>%
left_join(select(repeats, type, copyMean), by = c("type" = "type")) %>%
filter(name %in% genetreeOrder) %>%
mutate(type = paste0("TR-", type),
tip = paste0(consensus_nogap,
"\ntype: ", type,
"\nperiod: ", period,
"\ncopyN: ", copyMean),
name = ordered(name, levels = rev(genetreeOrder))) %>%
select(name, type, start, end, tip)
Make a seqLen object to draw the overall length of each
protein
seqLen <- blastInfo %>%
mutate(start = 1, name = ordered(name, levels = rev(genetreeOrder))) %>%
select(name, start, end = slen)
# GPI-anchor
# use pred.gpi
# feature set
# structure: id feature start end
feature <- bind_rows(
Hyphal_reg_CWP = pf11765,
# extend the signal peptide segment by 10 amino acids to make it more visible
`Signal Peptide` = signalp6 %>% mutate(end = end + 10) %>% select(name, start, end),
# extend the GPI-anchor C-terminus segment by 20 amino acids to make it more visible
`GPI-anchor` = pred.gpi %>% filter(gpi.pred) %>%
left_join(select(blastInfo, name, slen), by = "name") %>%
mutate(start = cleavePos-10, end = slen) %>%
select(name, start, end),
`Tandem Repeats` = tr %>% select(name, start, end),
.id = "type"
) %>% filter(!name %in% Mb.rm$name)
feature <- feature %>%
mutate(name = ordered(name, levels = rev(genetreeOrder)),
type = ordered(type, levels = c("Hyphal_reg_CWP", "Signal Peptide", "GPI-anchor", "Tandem Repeats")))
feature.colors <- c("Hyphal_reg_CWP" = "#3d85c6", "Signal Peptide" = "#cc0000", "GPI-anchor" = "#6a3d9a", "Tandem Repeats" = "#af8400bb")
h = 1.2
# plot
p0 <- ggplot(seqLen, aes(x = name, y = start, xend = name, yend = end)) +
geom_segment(color = "gray80", size = h)
p1 <- geom_segment(data = feature, aes(color = type), size = h)
p2 <- geom_segment(data = tandem.div, aes(x = name, xend = name, y = div, yend = div + 1.5),
size = h, color = "white")
p3 <- p0 + p1 + coord_flip() + theme_classic() +
scale_color_manual(values = feature.colors) +
scale_y_continuous(expand = expansion(mult = c(0.02, 0.02))) +
theme(axis.text.y = element_text(size = 3), axis.title.y = element_blank(),
axis.line.y = element_blank(), axis.ticks.y = element_blank(),
axis.line.x = element_blank(),# axis.ticks.x = element_blank(),
legend.position = c(0.8, 0.9),
legend.text = element_text(size = 12), legend.title = element_text(size = 14),
plot.title = element_text(hjust = 0.5),
panel.background = element_rect(fill = alpha("lightblue",0.1))) +
labs(y = "Position in sequence", x = "Sequences", color = "FEATURES")
#plot_grid(p.gtree, p3, ncol = 2, align = 'v', rel_widths = c(1,2), scale = c(1.01,1))
ggsave("../output/img/20220916-domain-tandem-repeats.png", plot = p3, width = 6, height = 7.5)
Fig. ? Blue boxes indicate the PF11765 domains while all other non-grey boxes indicate XSTREAM-determined tandem repeat domains. Colors are used to group highly similar tandem repeats. The black thin lines demarcate adjacent tandem repeat units. The table below shows the copy number, period and consensus sequence for each tandem domain organized by the host sequences.
#DT::datatable(
# tandem %>%
# dplyr::rename(seqL = seqLength, err = consensus_error, seq = consensus_nogap) %>%
# select(-seqAlign, -type, -consensus_gap, -seq, seq) %>%
# arrange(desc(name)),
# fillContainer = FALSE, options = list(pageLength = 10)
#)
Repeat the above analysis but distinguishing between all different TR types
require(RColorBrewer)
Loading required package: RColorBrewer
tr.th <- 100 # arbitrary threshold for distinguishing "short" from "long" TRs
tr.col <- character(nrow(repeats)) # create a color vector
short.rp <- which(repeats$length < tr.th) # identify the short repeats indices
long.rp <- setdiff(1:nrow(repeats), short.rp) # the long repeats indices
set.seed(123) # for reproducibly shuffling the order before assigning the colors
short.rp <- sample(short.rp) # shuffle the indices for short tandem repeats
tr.col[short.rp] <- colorRampPalette(brewer.pal(12, "Paired")[seq(3,11,by=2)])(length(short.rp)) # assign the short repeats a lower contrast color
set.seed(231) # for reproducibly shuffling the order before assigning the colors
long.rp <- sample(long.rp)
tr.col[long.rp] <- colorRampPalette(brewer.pal(12, "Paired")[seq(4,12,by=2)])(length(long.rp)) # assign the long repeats a higher contrast color
# -- desaturate the colors --
# https://stackoverflow.com/questions/26314701/r-reducing-colour-saturation-of-a-colour-palette
library(colorspace) ## hsv colorspace manipulations
## Function for desaturating colors by specified proportion
desat <- function(cols, sat=0.5) {
X <- diag(c(1, sat, 1)) %*% rgb2hsv(col2rgb(cols))
hsv(X[1,], X[2,], X[3,])
}
tr.col <- desat(tr.col, sat = 0.8)
# -- finish --
tr.col <- paste0(tr.col, "CC") # add 20% transparency to the TR features
names(tr.col) <- paste0("TR-", repeats$type) # name the colors by the TR types
repeats$color <- tr.col
Combine domains, SP and GPI-anchor with TR features.
# combine sequence features with tandem repeats
feature1 <- bind_rows(filter(feature, type != "Tandem Repeats"), tr) %>%
mutate(type = ordered(type, levels = c("Hyphal_reg_CWP", "SignalP", "GPI-anchor", unique(tr1$type))))
Error in `mutate()`:
! Problem while computing `type = ordered(...)`.
Caused by error in `h()`:
! error in evaluating the argument 'x' in selecting a method for function 'unique': object 'tr1' not found
Backtrace:
1. ... %>% ...
10. base::.handleSimpleError(`<fn>`, "object 'tr1' not found", base::quote(unique(tr1$type)))
11. base h(simpleError(msg, call))
# plot
p1 <- geom_segment(data = feature1, aes(color = type, text = tip), size = h)
p2 <- geom_segment(data = tandem.div, aes(x = name, xend = name, y = div, yend = div + 1.5),
size = h, color = "white")
p3 <- p0 + p1 + coord_flip() + theme_classic() +
scale_color_manual(values = feature.col1) +
scale_y_continuous(expand = expansion(mult = c(0.02, 0.02))) +
theme(axis.text.y = element_text(size = 3), axis.title.y = element_blank(),
axis.line.y = element_blank(), axis.ticks.y = element_blank(),
axis.line.x = element_blank(),# axis.ticks.x = element_blank(),
legend.position = "none", plot.title = element_text(hjust = 0.5),
panel.background = element_rect(fill = alpha("lightblue",0.1))) +
labs(y = "Position in sequence", x = "Sequences", color = "FEATURES")
# plot_grid(p.gtree, p3 + p2, ncol = 2, align = 'v', rel_widths = c(1,2), scale = c(1.01,1))
ggsave("../output/img/20220916-domain-tandem-repeats-distinct-color.png", plot = p3, width = 5, height = 7.5)
# plot
#require(plotly)
plotly::ggplotly(p3, tooltip = "text", width = 900, height = 900)
Registered S3 method overwritten by 'data.table':
method from
print.data.table
# reorder the sequences for plotting
# plot
#p1 <- ggplot(seqLen, aes(x = name, y = start, xend = name, yend = end)) + geom_segment(color = "gray80", size = 2)
p1 <- geom_segment(data = filter(pf11765, !name %in% Mb.rm$name),
aes(x = name, xend = name, y = start+1, yend = end), size = h, color = "#3d84c6")
p2 <- geom_segment(data = filter(tango, !name %in% Mb.rm$name),
aes(x = name, xend = name, y = ifelse(start-4 >= 0, start-4, 0), yend = end + 4, color = median), size = h)
p3 <- p0 + p1 + p2 + coord_flip() + theme_classic() +
scale_color_viridis_c(limits = c(5,101), breaks = c(5,seq(25,100,25)), direction = -1) +
theme(axis.text.y = element_text(size = 4), axis.title.y = element_blank(),
axis.line.y = element_blank(), axis.ticks.y = element_blank(),
axis.line.x = element_blank(), axis.ticks.x = element_blank(),
plot.title = element_text(hjust = 0.5),
legend.position = c(0.8,0.88),
legend.title = element_text(size = 14),
legend.text = element_text(size = 12),
panel.background = element_rect(fill = alpha("lightblue",0.1))) +
ylim(-2, 4500) + labs(y = "Position in sequence", x = "Sequences", color = "TANGO score")
p3
##plot_grid(p.gtree, p4, ncol = 2, align = 'v', rel_widths = c(1,2.5), scale = c(1.01,1))
ggsave("../output/img/20220916-tango-score-segment.png", plot = p3, width = 6, height = 7.5)
# plot the length distribution of the homologs in each species
ggplot(blastInfo, aes(x = species, y = slen)) +
geom_boxplot(outlier.shape = NA) + geom_hline(yintercept = 500, linetype = 2) +
#stat_summary(fun.data = "mean_cl_boot", color = "red") +
geom_dotplot(binwidth = 100, binaxis = "y", stackdir = "center", dotsize = 0.5,
binpositions = "all", fill = rgb(0,0,0,0.5), color = rgb(0,0,0,0.6)) +
scale_x_discrete(limits = rev) +
coord_flip() + labs(title = "Distribution of XP_028889033 homologs' length") +
theme(axis.title.y = element_blank(), axis.text.y = element_blank())
plot_grid(p.tree, p, align = "h", ncol = 2, rel_widths = c(1,1.5), scale = c(1.08,1))
Fig. 5 Distribution of Hil family homologs’ length across species. The shadings in the species tree on the left is the same as in Figure 1. Protein lengths are plotted both as a dotplot (binned in 100 a.a. bins) and a boxplot, where the box represents the interquartile range, the middle bar indicates the median and the whisker 1.5 times the interquartile range.
Discussion
Below is the relationship between the species studied in this analysis.
sps.tree <- read.tree(file = "data/20200724-species-tree.nwk")
p.tree <- sps.tree %>%
as_tibble() %>%
mutate(label = paste0(str_sub(label, 1, 1), ". ", str_sub(label, 2))) %>%
left_join(spsData, by = "label") %>%
as.treedata() %>%
ggtree(ladderize = FALSE) + xlim(0,17) + scale_y_reverse() +
geom_tiplab(aes(color = patho), size = 3.8, fontface = "italic", offset = 0.3) +
scale_color_manual(values = c("black","red"), guide = "none") +
#geom_hilight(node = 27, fill = "magenta", alpha = 0.15) + # Clavispora
geom_hilight(node = 23, fill = "#7F00FF", alpha = 0.15) + # MDR
#geom_hilight(node = 33, fill = "gold", alpha = 0.25) + # Candida
geom_hilight(node = 29, fill = "pink", alpha = 0.25) + # albicans
#geom_hilight(node = 37, fill = "grey50", alpha = 0.15) + # Sacchromycetaceae
geom_hilight(node = 33, fill = "steelblue", alpha = 0.15) # glabrata
#geom_cladelabel(node = 22, label = "Clavispora", offset = 3.7, color = "magenta",
# offset.text = 0.2, angle = 270, hjust = .5, extend = 0.3) + # Clavispora
p.tree +
geom_cladelabel(node = 23, label = "MDR", offset = 4.5, color = "purple", fontface = 2,
offset.text = 0.2, angle = 270, hjust = .5, extend = 0.3) + # MDR
#geom_cladelabel(node = 27, label = "Candida", offset = 2.7, color = "salmon",
# offset.text = 0.2, angle = 270, hjust = .5, extend = 0.3) + # Candida
geom_cladelabel(node = 29, label = "albicans", offset = 3.5, color = "hotpink2", fontface = 2,
offset.text = 0.2, angle = 270, hjust = .5, extend = 0.3) + # albicans
geom_cladelabel(node = 33, label = "glabrata", offset = 3.5, color = "steelblue",
offset.text = 0.2, angle = 270, hjust = .5, extend = 0.3) # Sacchromycetaceae
Fig. 1 Phylogenetic relationship of the species analyzed in this study. This species tree is based on the Maximum likelihood phylogeny from Muñoz et al. 2018 (PMID: 30559369) for the most part, with the exception of the species in the Saccharomycetaceae family, which is based on Gabaldón et al. 2016 (PMID: 27493146).
Note that he placement of Debaryomyces hansenii differs between the above two publications. The former placed it closer to the Clavispora genus while the latter placed it closer to the Candida genus. We based ours on the first publication. Another large scale phylogenetic analysis (Shen et al. 2018 (PMID: 30415838)), also placed D. hansenii closer to the Candida genus. However, in both studies, which reported bootstrap support values in addition to the phylogeny, the support for the part of the tree involving D. hansenii is relatively low suggesting uncertainty in resolving the relationship. We chose to place D. hansenii closer to the Clavispora genus because this is more congruent with the gene tree for the Hil family proteins we inferred below.
Species tree for the side of a figure
#p.tree <- sps.tree %>% as_tibble() %>%
# mutate(label = paste0(str_sub(label, 1, 1), ". ", str_sub(label, 2))) %>%
# as.treedata() %>% ggtree(ladderize = FALSE) + xlim(0,15) + scale_y_reverse() +
# geom_tiplab(size = 5, align = TRUE, linesize = .5, fontface = "italic", offset = 0.1) +
# geom_hilight(node = 22, fill = "hotpink", alpha = 0.2) + # Clavispora
# geom_hilight(node = 23, fill = "purple", alpha = 0.2) + # MDR
# geom_hilight(node = 27, fill = "hotpink", alpha = 0.2) + # Candida
# geom_hilight(node = 29, fill = "red", alpha = 0.2) + # albicans
# geom_hilight(node = 31, fill = "grey20", alpha = 0.2) # Sacchromycetaceae
ggsave("output/img/20210624-species-tree-side.png", p.tree, width = 3, height = 4.5)
The species tree above only includes species with at least one homolog of the Hil family and thus leaves out species that are usually included in the yeast phylogeny. Here I add some of them back to achieve two goals: 1) to help make the point that Candida pathogens have evolved independently multiple times, by adding non-pathogenic species around those that are pathogenic; 2) to enhance the point that the Hil family has experienced losses in the non-pathogenic species.
sps.tree.exp <- read.tree(file = "data/20211017-expanded-species-tree-for-fig1.nwk")
spsData <- read_tsv("data/20211024-extended-species-Als-Hil-homologs-number-table.txt")
# correct names
sps.tree.exp <- sps.tree.exp %>%
as_tibble() %>%
mutate(label = gsub("\\.", "\\. ", label)) %>%
left_join(spsData, by = "label") %>%
as.treedata()
# plotting
off = 4.5
p <- ggtree(sps.tree.exp, ladderize = FALSE) + xlim(0,16) + scale_y_reverse() +
geom_tiplab(aes(color = patho), size = 3.5, fontface = "italic", offset = 0.3) +
scale_color_manual(values = c("black","red"), guide = FALSE) +
#geom_hilight(node = 27, fill = "magenta", alpha = 0.15) + # Clavispora
geom_hilight(node = 28, fill = "#7F00FF", alpha = 0.15) + # MDR
#geom_hilight(node = 33, fill = "gold", alpha = 0.25) + # Candida
geom_hilight(node = 35, fill = "pink", alpha = 0.25) + # albicans
#geom_hilight(node = 37, fill = "grey50", alpha = 0.15) + # Sacchromycetaceae
geom_hilight(node = 41, fill = "steelblue", alpha = 0.15) + # glabrata
#geom_cladelabel(node = 27, label = "Clavispora", offset = off+1.2, color = "orchid3",
# offset.text = 0.3, angle = 270, hjust = .5, extend = 0.3) + # Clavispora
geom_cladelabel(node = 28, label = "MDR", offset = off+1, color = "#7F00FF", fontface = 2,
offset.text = 0.3, angle = 270, hjust = .5, extend = 0.3) + # MDR
geom_cladelabel(node = 33, label = "Candida", offset = off, color = "orange2", barsize = 0.2,
offset.text = 0.3, angle = 270, hjust = .5, extend = 0.3) + # Candida
geom_cladelabel(node = 35, label = "albicans", offset = off+1, color = "hotpink2", fontface = 2,
offset.text = 0.3, angle = 270, hjust = .5, extend = 0.3) + # albicans
geom_cladelabel(node = 37, label = "Saccharomycetaceae", offset = off, color = "grey50",
barsize = 0.2,
offset.text = 0.3, angle = 270, hjust = .5, extend = 0.3) + # Sacchromycetaceae
geom_cladelabel(node = 41, label = "glabrata", offset = off+1, color = "steelblue", fontface = 2,
offset.text = 0.3, angle = 270, hjust = .5, extend = 0.3) # glabrata
p
ggsave("output/img/20211023-species-tree-multiple-origin.png", width = 5, height = 5)
Also visualize the number of homologs in the Als and Hil families
df0 <- select(spsData, label, Als, Hil) %>%
pivot_longer(Hil:Als, names_to = "family", values_to = "number") %>%
mutate(species = factor(label, levels = rev(spsData$label)))
p <- ggplot(df0, aes(x = family, y = species)) +
geom_tile(aes(fill = number), color = "white", alpha = 0.4) +
geom_text(aes(label = number), color = "black") +
scale_fill_distiller(palette = "Greys",direction = 1) +
scale_x_discrete(position = "top") +
theme_cowplot() + theme(axis.title = element_blank(), legend.position = "none")
p
ggsave("output/img/20211024-extended-species-Als-Hil-family-size-plot.png", width = 3, height = 5)
gene.tree <- read.raxml("data/RAxML_bipartitionsBranchLabels.clustalo_3701250") %>%
root(node = 143, resolve.root = TRUE, edgelabel = TRUE)
geneNameConvert <- read_tsv("data/20211124-cauris-Hil-gene-name-convert.txt", col_types = "cc")
df0 <- seqInfo %>% select(name, species_id, species_gr) %>% left_join(geneNameConvert, by = "name")
gene.tree <- full_join(gene.tree, df0, by = "label")
sps.color <- c("MDR clade" = "#7F00FF9F",
"C. lusitaniae" = "plum2",
"D. hansenii" = "gray50",
"C. parapsilosis" = "orange2",
"S. stipitis" = "slategray",
"albicans clade" = "hotpink",
"Saccharomycetaceae" = "grey20")
p <- ggtree(gene.tree, aes(alpha = bootstrap), ladderize = TRUE) +
geom_nodelab(aes(x = branch, label = bootstrap, subset = bootstrap < 80),
alpha = 1, vjust = -.5, size = 2) +
scale_alpha_continuous(name = "Rapid bootstrap", breaks = c(20,60,100)) +
geom_tippoint(aes(color = species_gr)) +
scale_color_manual(name = "Clades / Species", values = sps.color) +
theme(text = element_text(size = 10))
p <- flip(p, 137,149)
p
ggsave("output/img/20210624-raxml-gene-tree-color.png", width = 7, height = 5)
Fig. 2 RAxML inferred gene tree for Hyr/Iff-Like (HIL) family members in Ascomycetes. The branch length is proportional to the inferred substitions per site. The tree is manually rooted on the Saccharomycetaceae family, which is the outgroup to both the Candida and Clavispora genera and whose Hil homologs form a distinct group. Protein sequence names are not shown for brevity, but are color coded based on the species (groups) they belong to. The clade color code is nested such that if a sequence belongs to both a species and a clade, e.g. C. auris and the MDR clade, the sequence will be colored based on the smaller phylogenetic unit, i.e. C. auris.
We reconciled the gene tree with the species tree in Notung 2.9. The purpose of this step is to reconstruct the history of gene duplications and losses (transfers don’t apply to this case) and to rearrange weakly supported parts of the gene tree to reduce the total event score in accordance with the evolutionary parsimony principal.
Briefly, after both the gene tree and the rooted species tree (see above) were loaded into the graphic user interface, the rooting analysis was run and the branch receiving the highest root score was the same as the manually chosen branch as shown in Fig. 2, namely the ancestral branch for all the Saccharomycetaceae proteins. After rooting and reconciliation, a total of 59 duplications and 47 losses were inferred. The reconciled gene tree was then rearranged. In this step, weakly supported branches – those with rapid bootstrap score < 90% – were allowed to be swapped and the event score, calculated as a weighted sum of the total loss and duplication events, was calculated for each rearrangement. Among all rearranged gene tree topologies, the one with the lowest event score had 42 duplications and 12 losses. This is the gene tree that is shown below.
gene.tree.rec <- read.nhx(file = "output/notung/RAxML_bipartitions_clustalo_3701250_rearrange80.nhx")
gene.tree.rec <- full_join(gene.tree.rec, df0, by = "label")
ggtree(gene.tree.rec, ladderize = FALSE, branch.length = "none") + xlim(0,20) + scale_y_reverse() +
geom_label(aes(x = branch, label = ifelse(S %in% c("Saccharomycetaceae", "CUG-Ser1", "DH", "SS") & node != 106, S, NA)), fill = "gray", size = 3) +
geom_tiplab(size = 1.5, offset = 0.2) +
geom_nodepoint(aes(fill = D), shape = 21, color = alpha("grey10", 0)) + scale_fill_manual(values = c("Y" = "red2", "N" = alpha("grey10", 0))) +
#geom_text(aes(label = ifelse(D == "Y", "D", NA)), hjust = -0.4, size = 2, color = "red") +
geom_tippoint(aes(color = species_gr)) +
scale_color_manual(name = "Clades", values = sps.color) +
theme(text = element_text(size = 14))
ggsave("output/img/20211124-reconciled-rearranged80-gene-tree-color.png", width = 7.5, height = 7)
Fig. 3 Reconciled and rearranged gene tree for Hyr/Iff-Like (HIL) family members in Ascomycetes. The cladogram shows only the topology of the tree, with endpoints colored in the same way as in Fig. 2. A red “D” next to an internal node indicates an inferred gene duplication event at that node. The labels with gray background highlight the main features of the tree: 1) the Saccharomycetaceae sequences form the outgroup, suggesting there was no ancient duplication prior to the divergence of the family and the remaining species; 2) the CUG-Ser1 clade, which contains both the Candida and Clavispora genera, forms two duplicate groups, suggesting an early duplication event in the clade; 3) the top CUG-Ser1 branch further experienced extensive duplications independently in the Clavispora genus, labeled by the outgroup D. hansenii (DH), and the Candida genus, labeled by the outgroup S. stipitis (SS).
p.gtree <- ggtree(gene.tree.rec, ladderize = FALSE, branch.length = "none") +
scale_y_reverse() +
geom_label(aes(x = branch, label = ifelse(S %in% c("Saccharomycetaceae", "CUG-Ser1", "DH", "SS") & node != 106, S, NA)), fill = "gray", size = 3) +
geom_tippoint(aes(color = species_gr), show.legend = FALSE) +
scale_color_manual(name = "Clades", values = sps.color) +
theme(text = element_text(size = 14))
ggsave("output/img/20210626-reconciled-gene-tree-side.png", width = 3, height = 6)
export the gene tree order for plotting domain feature maps
gene.tree.rec %>% as_tibble() %>% pull(name) %>% head(104) %>%
cat(file = "data/reorder_by_gene_tree.txt", sep = "\n")
# read in notung parsed event summary stats
notung.stat <- read_tsv("output/notung/RAxML_bipartitions.clustalo_3701250_rearrange80_event_summary.txt", col_types = 'cii')
sps.tree %>%
full_join(notung.stat, by = "label") %>%
as_tibble() %>%
mutate(label = paste0(str_sub(label, 1, 1), ". ", str_sub(label, 2)),
patho = c(T,T,T,T,T,F,T,F,T,T,T,F,F,F,T,T,F,T,rep(NA, 35-18))) %>%
as.treedata() %>%
ggtree(ladderize = FALSE) + scale_y_reverse() +
geom_tiplab(aes(color = patho), size = 3.5, fontface = "italic", offset = 0.3) + xlim(0,11) +
scale_color_manual(values = c("black","red"), guide = "none") +
#geom_tiplab(size = 3.5, fontface = "italic", offset = 0.1) +
#geom_hilight(node = 22, fill = "magenta", alpha = 0.15) + # Clavispora
geom_hilight(node = 23, fill = "purple", alpha = 0.15) + # MDR
#geom_hilight(node = 27, fill = "gold", alpha = 0.15) + # Candida
geom_hilight(node = 29, fill = "pink", alpha = 0.25) + # albicans
geom_hilight(node = 33, fill = "steelblue", alpha = 0.15) + # glabrata
geom_text(aes(label = duplications), hjust = 3, vjust = -.3,
size = 3, color = "red", fontface = 2) +
geom_text(aes(label = paste0("/", losses)), hjust = 1.3, vjust = -.3,
size = 3, color = "grey20", fontface = 2) +
geom_label(aes(label = ifelse(duplications >= 3, duplications, NA)), hjust = 2.1,
vjust = -.11, size = 3, color = "red", fill = "yellow1", fontface = 2,
label.size = 0, label.padding = unit(0.12, "lines"))
ggsave("output/img/20210622-species-tree-with-gains-losses.png", width = 4.5, height = 5)
Fig. 4 Inferred gene duplications and losses in the Hyr/Iff-Like (HIL) family in Ascomycetes. The cladogram shows the species relationship with shadings as in Figure 1. The numbers on top of each branch are the inferred duplications (red) and losses (black) using Notung 2.9. Yellow highlight emphasize the branches that experience three or more duplications.
Conclusions
While I was performing BLAST on FungiDB to identify homologs of our protein, I noticed that many of the hits appear to be at the beginning and end of the chromosomes. To see if there is a systematic trend in the chromosomal locations, I collected this information for all 99 homologs in the list. Note however, not all genomes are assembled to the chromosomal level, and as a result, the locations for the proteins in those species/strains would be relative to a contig or scaffold. My rationale is that if the contig or scaffold is close to choromosomal length (although that varies by at least an order of magnitude), I could at least look at them separately.
#chrLoc <- read_tsv("data/XP_028889033_homologs_chr_loc.tsv", col_types = cols()) %>%
# mutate(id = ifelse(is.na(accession), gene_id, accession),
# accession = NULL,
# species_id = factor(species_id, levels = sps.list),
# relLoc = round(chrstart/chrL, 3))
#seqInfo <- seqInfo %>%
# left_join(chrLoc) %>%
# mutate(assemblystatus = factor(assemblystatus, levels = c("Complete Genome","Chromosome","Contig","Scaffold"),
# labels = c("Chromosome", "Chromosome", "Partial", "Partial")),
# species_gr = factor(species_id, labels =
# c(rep("MDR clade",4),"C. lusitaniae","D. hansenii",rep("C. parapsilosis",2),
# rep("albicans clade",3), "S. stipitis", rep("Saccharomycetaceae",6)))) %>%
# select(name, id, source, starts_with("gene"), starts_with("species"), strain, starts_with("assembly"),
# length, n_seqs, starts_with("chr"), relLoc)
p <- seqInfo %>%
mutate(chr = paste(substr(species_id,1,4), chraccver, sep = "_"),
chr = reorder(chr, chrL, mean)) %>%
ggplot(aes(x = chr)) +
geom_segment(aes(xend = chr, y = 1, yend = chrL, color = assemblystatus), size = 2) +
geom_segment(aes(xend = chr, y = chrstart, yend = chrstart+7000), color = "red", size = 3)
p + coord_flip() + theme_classic() + scale_color_manual(values = c("grey50", "grey")) +
theme(axis.text.y = element_text(size = 5),
axis.line.y = element_blank(), axis.ticks.y = element_blank(),
axis.line.x = element_blank(), axis.ticks.x = element_blank(),
legend.position = c(0.8,0.2),
panel.background = element_rect(fill = alpha("lightblue",0.1))) +
labs(y = "Position", x = "Chromosomes / Scaffolds", color = "Assembly Status")
ggsave("output/img/20210213-homologs-chr-loc.png", bg = "transparent", width = 5, height = 5)
p <- seqInfo %>% filter(!(chrL < 1e6 & assemblystatus == "Partial")) %>%
mutate(dTip = ifelse(relLoc < 0.5, relLoc, 1-relLoc)) %>%
ggplot(aes(x = dTip)) + facet_wrap(~assemblystatus) +
xlab("Distance from Chromosome or Scaffold Ends") + labs(fill = "Species group") +
scale_x_continuous(breaks = seq(0,0.5,by = 0.1))
p + geom_histogram(binwidth = 0.05)
ggsave("output/img/20210213-distribution-on-chromosome-or-scaffolds-histogram-grey.png", width = 5, height = 3)
p + geom_histogram(aes(fill = species_gr), binwidth = 0.05) + scale_fill_brewer(palette = "Paired")
ggsave("output/img/20210213-distribution-on-chromosome-or-scaffolds-histogram.png", width = 6, height = 3)
Discussion
A recent long-read sequencing study for C. glabrata annotated 31 novel ORFs, of which 24 are GPI-Cell Wall Proteins. The authors cited previous literature supporting the in the subtelomeric regions Xu 2020. This plus the empirical observation above showing an apparent enrichment towards the chromosome ends lead me to ask whether there is indeed a significant enrichment among this family of proteins in the subtelomeric regions.
To formally test this hypothesis, I need to account for the background gene density differences along the chromosomes. The idea is to compare the chromosomal positions for this group of proteins compared with the gene densities on the chromosomes they reside on. We will restrict our analysis to those genomes with a chromosomal level assembly status for obvious reasons.
Update 2021-06-21 We will also include C. auris B11221 as we know its assembly is nearly at the chromosomal level
use.sps <- seqInfo %>%
filter(assemblystatus == "Chromosome" | species_id == "Cauris") %>%
filter(!species_id %in% c("Cdubliniensis", "Ncastellii")) %>%
# phylogenetic relatedness can induce correlations, including when testing for enrichment at the chromosomal ends
# based on the gene duplication patterns, combined with YGOB synteny based orthology assignments, the set of species
# used here, after excluding Cdubliniensis, represent quasi phylogenetically independent contrasts, thanks to the
# many species-specific duplications
select(species_id, assembly) %>% unique()
use.sps
# now let's also create a new tibble for our homologs from these species
fg.freq <- chrLoc %>%
filter(species_id %in% use.sps$species_id) %>%
# remove the one C. auris gene located on an unassembled fragment
filter(!(species_id == "Cauris" & chr_name == "scaffold00015")) %>%
# change the scaffold name to chromosome name (they correspond)
mutate(chr_name = gsub("scaffold0+", "", chr_name))
To conduct this test, we first need to prepare and compute the background gene densities. To do this, we will gather the genome assembly files for the selected species, read them into R, and generate a table that contains one row for each gene, with its gene ID, chromosome number, chromosome length and the start position expressed as a percentage measured from the chromosome ends.
# 1. prepare file names
# get all file names that ends with "feature_table.txt.gz", which contain the gene annotation
feature.files <- list.files(path = "data/assembly-info/", pattern = "*feature_table.txt.gz$")
names(feature.files) <- sapply(str_split(feature.files, pattern = "_"), function(x) {
paste(x[1], x[2], sep = "_")
})
# get all file names that ends with "assembly_report.txt", which contain the chromosomal length
assembly.files <- list.files(path = "data/assembly-info/", pattern = "*assembly_report.txt$")
names(assembly.files) <- sapply(str_split(assembly.files, pattern = "_"), function(x) {
paste(x[1], x[2], sep = "_")
})
use.sps$FeatureFile <- feature.files[use.sps$assembly]
use.sps$AssemblyFile <- assembly.files[use.sps$assembly]
# 2. read in the assembly information
feature.col.names <- c("feature","class","assembly","assembly_unit","seq_type","chromosome","genomic_accession","start","end","strand","product_accession","non_redundant_refseq","related_accession","name","symbol","GeneID","locus_tag","feature_interval_length","product_length","attributes")
assembly.col.names <- c("chromosome","seq_role","assign_molecule","type","gb_acc","relationship",
"chraccver","assembly_unit","seqL","ucsc_name")
compute.bg.freq <- function(row){
assembly.file <- row["AssemblyFile"]
assembly <- read_tsv(paste0("data/assembly-info/",assembly.file), comment = "#",
col_names = assembly.col.names, col_types = "ccccccccic")
feature.file <- row["FeatureFile"]
feature <- read_tsv(paste0("data/assembly-info/",feature.file), col_names = feature.col.names,
col_types = "ccccccciicccccccciic", skip = 1)
res <- feature %>%
filter(feature == "mRNA") %>%
# these feature tables are organized hierarchically, with the top level being "gene"
# the next level one of "mRNA", "ncRNA", "tRNA" or "rRNA". we only count protein-coding genes, i.e.
# "mRNA". the reason I didn't select the "CDS" feature type is because in a small number of cases,
# one mRNA feature contains more than one CDS feature, possibly due to splicing or alternative
# translational start site
select(chromosome, start, end) %>%
left_join(select(assembly, chromosome, chraccver, seqL), by = c("chromosome" = "chromosome")) %>%
mutate(relLoc = round(start / seqL, 3))
return(res)
}
# apply the function to the genomes, but leave out C. auris
bg.freq <- apply(filter(use.sps, species_id != "Cauris"), MARGIN = 1, compute.bg.freq)
names(bg.freq) <- use.sps %>% filter(species_id != "Cauris") %>% pull(species_id)
Separately calculate the background frequency for C. auris
compute.bg.freq.cau <- function(){
assembly.col.names <- c("chromosome","seq_role","assign_molecule","type","gb_acc","relationship",
"chraccver","assembly_unit","seqL","ucsc_name")
row = use.sps %>% filter(species_id == "Cauris") # C. auris entry
assembly.file <- row["AssemblyFile"]
assembly <- read_tsv(paste0("data/assembly-info/",assembly.file), comment = "#",
col_names = assembly.col.names, col_types = "ccccccccic") %>%
filter(as.integer(str_sub(chromosome, -2, -1)) <= 7)
feature.file <- row["FeatureFile"]
feature <- read_tsv(paste0("data/assembly-info/",feature.file), col_names = feature.col.names,
col_types = "ccccccciicccccccciic", skip = 1)
res <- feature %>%
filter(feature == "mRNA") %>%
select(chraccver = genomic_accession, start, end) %>%
right_join(select(assembly, chromosome, chraccver, seqL), by = "chraccver") %>%
mutate(relLoc = round(start / seqL, 3),
chromosome = gsub("scaffold0+","",chromosome))
return(res)
}
bg.freq$Cauris <- compute.bg.freq.cau()
Let’s take a look at the gene density in one of the genomes, e.g. D. hansenii
bg.freq$Dhansenii %>% ggplot(aes(x = relLoc)) +
geom_density() + facet_wrap(~ chromosome) + ggtitle("D. hansenii") + theme(plot.title = element_text(hjust = 0.5))
We can see that there is typically a drop in gene density towards the chromosome ends, which makes the observation that our protein family are biased towards the chromosomal ends even more striking.
_Note that the above plot was wrong due to the way the density() function estimates the kernel density, which assumes that the data outside the range are possible, in this case smaller than 0 and greater than 1. This leads to the underestimation of the density at the boundaries because it includes regions outside the possible range, where there is no observation. See this post for discussion https://github.com/tidyverse/ggplot2/issues/3387_
Below I use the geom_histogram function instead, with
breaks specified manually. This results in a density distribution that
is pretty flat across the chromosomes.
bg.freq$Dhansenii %>% ggplot(aes(x = relLoc)) +
geom_histogram(breaks = seq(0,1,0.05)) +
facet_wrap(~ chromosome) + ggtitle("D. hansenii") + theme(plot.title = element_text(hjust = 0.5))
ggsave("output/img/20210303-dhansenii-chromosome-gene-density-histogram.png", width = 6, height = 4)
In order to compare the distribution of all genes in
different bins of the chromosomes to the homologs in our case study, we
can divide each chromosome in the nrow(use.sps) genomes
into an arbitrary number of bins after “folding” them in half,
e.g. 0-10%, 10-20%, 20-30%, 30-40% and 40-50%. To be able to visually
compare the distribution of our homologs and the genome background, we
will create a special “chromosome” class that will be our homologs and
combine them with the bg.freq table.
freq.bins <- c(-0.001, seq(0.1, 0.5, 0.1)); freq.binsL <- c(0, freq.bins[-1])
freq.label <- paste0(head(freq.binsL, -1)*100,"-",tail(freq.binsL, -1)*100,"%")
bg.freq.tb <- bind_rows(bg.freq, .id = "species") %>%
mutate(fold.relLoc = ifelse(relLoc <= 0.5, relLoc, 1-relLoc),
bin = cut(fold.relLoc, breaks = freq.bins, labels = freq.label))
# add the homologs
fg.freq <- fg.freq %>%
mutate(fold.relLoc = ifelse(relLoc <= 0.5, relLoc, 1-relLoc),
bin = cut(fold.relLoc, breaks = freq.bins, labels = freq.label))
freq.plot <- fg.freq %>%
mutate(chromosome = "X") %>% # we label the homologs as X to make it a separate class
select(species = species_id, chromosome, bin) %>%
bind_rows(select(bg.freq.tb, species, chromosome, bin))# %>%
#filter(!species %in% c("Ncastellii")) # remove one species to make it easier to plot
# plot
freq.plot %>%
mutate(species = paste0(substr(species, 1, 1), ". ", substr(species, 2, 15))) %>%
ggplot(aes(x = chromosome, group = bin, fill = bin)) +
geom_bar(position = position_fill()) +
facet_wrap(~ species, scales = "free_x") +
#scale_fill_brewer("Distance from\nchromosome end", type = "qual", palette = 3) +
scale_fill_viridis_d(direction = -1, end = 0.95, alpha = 0.9) +
scale_y_continuous(name = "Cumulative % of genes", trans = "reverse", breaks = seq(0,1,0.2)) +
theme_cowplot() + panel_border(color = "grey80") +
theme(legend.position = "none", strip.background = element_blank(),
strip.text.x = element_text(face = 3))
ggsave("output/img/20210303-compare-homologs-chromosomal-locations-to-bg.png", width = 6, height = 4)
In the plot above, “X” is a special category that collects our homologs. We can see that the distribution of these proteins on the chromosome deviate from the background. Note that a few of the seven species contain only 1-2 PF11765 homologs – these include the three in the middle row – and we should interpret their homologs distribution with caution.
Now that we have the background gene frequencies computed, we can start constructing a test that tests for significant departure in the chromosomal locations of our protein family from the background frequencies. One idea is to divide the chromosome into equal-sized bins and use the frequencies of all genes in each bin as the multinomial probabilities in the null hypothesis. If our proteins were randomly selected from each chromosome without regard to their location, we would expect their locations to conform to the background frequencies. This can be tested using an exact multinomial test or an approximate Chi-square test or G-test. Since we don’t distinguish the two ends of a chromosome, we can “fold” the chromosome “in half” and measure each gene’s location as a percentage from the closer end, i.e. the relative location ranging from 0%-50%. If we can reject the null hypothesis, that constitutes evidence that the proteins from our group are not randomly selected from the background set.
Result of multinomial test
# install XNomial package for the multinomial exact test function
# https://cran.r-project.org/web/packages/XNomial/vignettes/XNomial.html#e1
while(!require(XNomial))
suppressMessages(install.packages("XNomial"))
# calculate pooled background frequency
bg.cnt <- cut(bg.freq.tb$fold.relLoc, breaks = freq.bins, labels = freq.label) %>% tabulate()
fg.cnt <- tabulate(fg.freq$bin)
xmulti(obs = fg.cnt, expr = bg.cnt, detail = 3)
Note that the three P-values correspond to three approaches of testing the goodness-of-fit. The LLR, which stands for Log Likelihood Ratio, is generally preferred. But all three said the same thing: the observation deviates from the background frequency significantly. By looking at the plots above, it is clear that the deviation is due to the excess of our homologs residing in the 0-10% bin, which is the tip of the chromosomes.
The test above assumes that the gene density along all chromosomes in the seven species follow the same distribution. The empirical observation supports that hypothesis. Should that to be not the case, we could also account for the variability in gene density distribution between species or even between chromosomes within a species. The idea is to first collect the chromosomes that our homologs come from, and then randomly sample the same number of genes as the number of our homologs on that chromosome. Do this for all of the homolog-containing chromosomes, we would obtain one random sample. Repeat this process 1000 times or more, we will then get the pseudosample, which we can use to compare with our observations. Here we need to come up with a statistic that summarizes each of our observation or pseudosample. For example, we could calculate the median % location for each sample, and ask where our observation lie relative to the pseudosamples. I’ll skip this approach for now since the first appraoch appears to be OK given the similar gene density distribution across chromosomes and species.